Introduction

Evaluation of bone marrow fibrosis is essential in the assessment of newly diagnosed hematological malignancies and important for monitoring disease progression and therapeutic response, particularly in myeloproliferative neoplasms (MPNs). We previously developed an AI-based model, known as Continuous Indexing of Fibrosis (CIF), that can robustly quantitate marrow fibrosis, capturing fibrosis severity and heterogeneity beyond the limits of conventional manual histological assessment. This tool not only supports the diagnosis and classification of MPNs but has also shown significant potential for detecting early disease progression and enhancing the evaluation of novel therapeutics targeting myelofibrosis.

In this study we exhaustively evaluate the performance and utility of our research-grade CIF model using a large cohort of clinical bone marrow trephine (BMT) samples from a regional referral centre in the UK. In parallel, we assess its impact on fibrosis grading by a panel of international expert hematopathologists from several leading diagnostic centres.Method

CIF ranks predefined tile-level features from reticulin-stained whole slide images (WSIs) of BMTs on a normalized 0-1 scale, with each tile's rank defining its CIF score. The model outputs a range of values, including average sample score and sample heterogeneity. Tile scores were visualized as heatmaps overlaid on reticulin-stained WSIs during validation, enabling highly intuitive model interpretation.

To assess performance at scale with real-world clinical data, we analyzed 1000 sequential BMT WSIs (April 2023 - July 2024) from the digital archive of Oxford University Hospitals NHS Foundation Trust. Diagnoses included myeloid (62%), non-myeloid (12%), no evidence of hematological malignancy (20%) and suboptimal samples (7%). A panel of 14 international hematopathologists independently reviewed all WSIs, with 836 (84%) flagged as being of sufficient quality to grade fibrosis. Employing a crossover study design with embedded washout periods, we assessed both intra- and inter-observer variability in fibrosis evaluation, with and without access to the CIF heatmap overlays.Results

Although trained on normal/reactive and MPN samples, CIF successfully mapped fibrosis scores in all cases deemed interpretable by our panel, regardless of diagnosis, staining variation or tissue processing quality.

Overall intra-rater agreement for WHO myelofibrosis grading among the expert haematopathologists was only 66.3% when reticulin-stained WSIs were re-evaluated without algorithmic support. Inter-rater agreement, assessed with quadratic-weighted Cohen's kappa, ranged from 0.51-0.84 (median 0.65), highlighting the inconsistency in fibrosis grading in real-world practice. These findings challenge historical reports of high concordance when conducted by experienced hematopathologists.

CIF heatmap augmentation yielded a small but statistically significant improvement in inter-rater agreement: Cohen's kappa increased to 0.60-0.84 (Wilcoxon signed-rank test, p=0.0017). Consensus on MF grade improved with CIF support (odds ratio 1.43, Wald test, p=0.0001) and crucially, individual pathologist grading aligned more closely with the expert consensus (odds ratio 1.20, Chi-squared test, p=0.0013).Conclusion

This study validates the CIF model for quantitative fibrosis assessment using real-world clinical data. CIF performed robustly across a broad range of diagnoses and sample qualities, and heatmap overlays significantly improved inter-observer consensus among expert hematopathologists with minimal prior AI experience.

Beyond showcasing our model's utility, these findings expose the limitations of manual fibrosis grading, with important implications for diagnosis, prognostication, treatment eligibility, and the interpretability of clinical trial data.

To our knowledge, this represents the largest real-world evaluation of a deep learning-based approach for BMT analysis in hematopathology. These findings lay the foundation for prospective evaluation of CIF model integration into routine diagnostic workflows, with the aim of systematically improving the accuracy and reproducibility of fibrosis grading. This work advances hematopathology towards the standards of objectivity and consistency demanded in the era of precision medicine, with the potential to deliver wide-ranging benefits for patient care and outcomes in MPNs.

This content is only available as a PDF.
Sign in via your Institution